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Abstract. We present a new model for time series classification, called 
the hidden-unit logistic model, that uses binary stochastic hidden units 
to model latent structure in the data. The hidden units are connected in 
a chain structure that models temporal dependencies in the data. Com¬ 
pared to the prior models for time series classification such as the hidden 
conditional random field, our model can model very complex decision 
boundaries because the number of latent states grows exponentially with 
the number of hidden units. We demonstrate the strong performance of 
our model in experiments on a variety of (computer vision) tasks, includ¬ 
ing handwritten character recognition, speech recognition, facial expres¬ 
sion, and action recognition. We also present a state-of-the-art system 
for facial action unit detection based on the hidden-unit logistic model. 


1 Introduction 

Time series classification is the problem of assigning a single label to a sequence 
of observations (i.e., to a time series). Time series classification has a wide range 
of applications in computer vision. A state-of-the-art model for time series clas¬ 
sification problem is the hidden-state conditional random field (HCRF) [22] . 
which models latent structure in the data using a chain of fc-nomial latent vari¬ 
ables. The HCRF has been successfully used in, amongst others, gesture recog¬ 
nition HZ], object recognition [^, and action recognition |55|. An important 
limitation of the HCRF is that the number of model parameters grows linearly 
with the number of latent states in the model. This implies that the training 
of complex models with a large number of latent states is very prone to overfit¬ 
ting, whilst models with smaller numbers of parameters may be too simple to 
represent a good classification function. In this paper, we propose to circumvent 
this problem of the HCRF by replacing each of the /c-nomial latent variables 
by a collection of H binary stochastic hidden units. To keep inference tractable, 
the hidden-unit chains are conditionally independent given the time series and 
the label. Similar ideas have been explored before in discriminative RBMs [12] 
for standard classification problems and in hidden-unit CRFs m for sequence 
labeling. The binary stochastic hidden units allow the resulting model, which we 
call the hidden-unit logistic model (HULM), to represent 2^ latent states using 
only 0{H) parameters. This substantially reduces the amount of data needed to 
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successfully train models without overfitting, whilst maintaining the ability to 
learn complex models with exponentially many latent states. Exact inference in 
our proposed model is tractable, which makes parameter learning via (stochastic) 
gradient descent very efficient. We show the merits of our hidden-unit logistic 
model in experiments on computer-vision tasks ranging from online character 
recognition to activity recognition and facial expression analysis. Moreover, we 
present a system for facial action unit detection that, with the help of the hidden- 
unit logistic model, achieves state-of-the-art performance on a commonly used 
benchmark for facial analysis. 

The remainder of this paper is organized as follows. Section 2 reviews prior 
work on time series classification. Section 3 introduces our hidden-unit logistic 
model and describes how inference and learning can be performed in the model. 
In Section 4, we present the results of experiments comparing the performance 
of our model with that of state-of-the-art time series classification models on 
a range of classification tasks. In Section 5, we present a new state-of-the-art 
system for facial action unit detection based on the hidden-unit logistic model. 
Section 6 concludes the paper. 

2 Related Work 

There is a substantial amount of prior work on time series classification. Much 
of this work is based on the use of (kernels based on) dynamic time warping 
{e.g., 0 ) or on hidden Markov models (HMMs) The HMM is a generative 
model that models the time series data in a chain of latent fc-nomial features. 
Class-conditional HMMs are commonly combined with class priors via Bayes’ 
rule to obtain a time series classification models. Alternatively, HMMs are also 
frequently used as the base model for Fisher kernel 0, which constructs a time 
series representation that consists of the gradient a particular time series induces 
in the parameters of the HMM; the resulting representations can be used on 
standard classifiers such as SVMs. Some recent work has also tried to learn the 
parameters of the HMM in such a way as to learn Fisher kernel representations 
that are well-suited for nearest-neighbor classification [TB] . HMMs have also been 
used as the base model for probability product kernels [7] , which fit a single HMM 
on each time series and define the similarity between two time series as the inner 
product between the corresponding HMM distributions. A potential drawback 
of these approaches is that they perform classification based on (rather simple) 
generative models of the data that may not be well suited for the discriminative 
task at hand. By contrast, we opt for a discriminative model that does not waste 
model capacity on features that are irrelevant for classification. 

In contrast to HMMs, conditional random fields (CRFs; [lO]) are discrimi¬ 
native models that are commonly used for sequence labeling of time series using 
so-called linear-chain CRFs. Whilst standard linear-chain CRFs achieve strong 
performance on very high-dimensional data (e.g., in natural language process¬ 
ing), the linear nature of most CRF models limits their ability to learn complex 
decision boundaries. Several sequence labeling models have been proposed to ad- 
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dress this limitation, amongst which are latent-dynamic CRFs [20], conditional 
neural fields |2T|, and hidden-unit CRFs m- These models introduce stochastic 
or deterministic hidden units that model latent structure in the data, allowing 
these models to represent nonlinear decision boundaries. As these prior models 
were designed for sequence labeling (assigning a label to each frame in the time 
series), they cannot readily be used for time series classihcation (assigning a 
single label to the entire time series). Our hidden-unit logistic model may be 
viewed as an adaptation of sequence labeling models with hidden units to the 
time series classification problem. As such, it is closely related to the hidden CRF 
model [22] . The key difference between our hidden-unit logistic model and the 
hidden CRF is that our model uses a collection of binary stochastic hidden units 
instead of a single fc-nomial hidden unit, which allows our model to represent 
exponentially more states with the same number of parameters. 

An alternative approach to expanding the number of hidden states of the 
HCRF is the infinite HCRF (iHCRF), which employs a Dirichlet process to 
determine the number of hidden states. Inference in the iHCRF can be performed 
via collapsed Gibbs sampling [5] or variational inference |3| . Whilst theoretically 
facilitating infinitely many states, the modeling power of the iHCRF is, however, 
limited to the number of “represented” hidden states. Unlike our model, the 
number of parameters in the iHCRF thus still grows linearly with the number 
of hidden states. 


3 Hidden-Unit Logistic Model 

The hidden-unit logistic model is a probabilistic graphical model that receives 
a time series as input, and is trained to produce a single output label for this 
time series. Like the hidden-state CRF, the model contains a chain of hidden 
units that aim to model latent temporal features in the data, and that form the 
basis for the final classification decision. The key difference with the HCRF is 
that the latent features are model in H binary stochastic hidden units, much 
like in a (discriminative) RBM. These hidden units Z( can model very rich latent 
structure in the data: one may think about them as carving up the data space into 
2^ small clusters, all of which may be associated with particular clusters. The 
parameters of the temporal chains that connect the hidden units may be used 
to differentiate between features that are “constant” (i.e., that are likely to be 
presented for prolonged lengths of time) or that are “volatile” (i.e., that tend to 
rapidly appear and disappear). Because the hidden-unit chains are conditionally 
independent given the time series and the label, they can be integrated out 
analytically when performing inference or learning. 

Suppose we are given a time series = {xi,...,X 7 ’} of length T in 

which the observation at the f-th time step is denoted by Xj G . Conditioned 
on this time series, the hidden-unit logistic model outputs a distribution over 
vectors y that represent the predicted label using a l-of-AT encoding scheme (i.e., 
a one-hot encoding): Vfc : j/fc € {0,1} and yk = 1. 
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Fig. 1. Graphical model of the hidden-unit logistic model. 


Denoting the stochastic hidden units at time step t hy Zt G {0,1}^, the 
hidden-unit logistic model defines the conditional distribution over label vectors 
using a Gibbs distribution in which all hidden units are integrated out: 


T(y|xi,...,T) = 




( 1 ) 


Herein, denotes a partition function that normalizes the distribution, 

and is given by: 


'y' _ T 


The energy function of the hidden-unit logistic model is defined as: 


F(xi,„„T,Zi„..,r,y) = zJtt + zf t -l-cTy-I- ^^2zf-idiag(A)zt + [z^Wx* -I-z^Vy + z^b] 


The graphical model of the hidden-unit logistic model is shown in Figure 


(3) 


3.1 Inference 

The main inferential problem given an observation is the evaluation of 

predictive distribution p(y|xi_...^T)- The key difficulty in computing this predic¬ 
tive distribution is the sum over all 2^^^ hidden unit states: 

M(xi,...,T,y) = X exp{F;(xi,...,T,zi,....T,y)}- (4) 

The chain structure of the hidden-unit logistic model allows us to employ a 
standard forward-backward algorithm that can compute M(-) in computational 
time linear in T. 
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Specifically, defining potential functions that contain all terms that involve 
time t and hidden unit h: 

zt-i,h, Zt,h, y) = ex.p{zt-i^hA.hZt,h + Zt,h'W'h^t + Zt^hShY + Zt,hbh}, 

ignoring bias terms, and introducing virtual hidden units zq = 0 at time f = 0, 
we can rewrite M(-) as: 


T H H T 

t=l /l=l h=l Zi_h,---,ZT,h t=l 

H r 

= n E ^T.h(xT, «T-l,fe, y) ^ 'PT-l,h{^T-l,ZT- 2 ,h,ZT-l,h,y) ■ ■ A ■ 

h=l ZT-l,h ZT- 2 ,h 

In the above derivation, it should be noted that the product over hidden units 
h can be pulled outside the sum over all states Zi^,,.^t because the hidden-unit 
chains are conditionally independent given the data and the label y. 

Subsequently, the product over time t can be pulled outside the sum because of 
the (first-order) Markovian chain structure of the temporal connections between 
hidden units. 

In particular, the required quantities can be evaluated using the forward- 
backward algorithm, in which we define the forward messages at^h,k with k G 
{0,1} as: 

t 

at,h,k= E n = fc,y), 

t'=i 

and the backward messages I3t^h,k as: 

T 

At,h,k= E n '^t',h{^t'+i,zt',h = k,zt'+i,h,y)- 

Zt + l,h^---,ZT,h 

These messages can be calculated recursively as follows: 

at,h,k= E '^t,h{^t,Zt-i,h = i,zt,h = k,y)at-i,h,i 
ie{o.i} 

I3t,h,k= E '^t+i,h{^t+i,Zt,h = k,Zt+i,h = i,y)Pt+i,h,i- 

iG{0,l} 

The value of M(xi y) can readily be computed from the resulting for¬ 
ward messages or backward messages: 


A^(xn...,T,y) = n 


E ' 

fce{o,i} 


E Al,h,k 

feelo.i} 


To complete the evaluation of the predictive distribution, we compute the parti¬ 
tion function of the predictive distribution by summing M(xi^,,._j’, y) over all K 
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possible labels: Indeed, inference in the hidden- 

unit logistic model is linear in both the length of the time series T and in the 
number of hidden units H. 

Another inferential problem that needs to be solved during parameter learn¬ 
ing is the evaluation of the marginal distribution over a chain edge: 

^t,h,k,i = P{zt,h = fc, zt+i,h = y). 

Using a similar derivation, it can be shown that this quantity can also be com¬ 
puted from the forward and backward messages: 

^fce{0,l} ^T,h,k 


3.2 Parameter Learning 


Given a training set D = {(x(n)i^...^T, y^n,))}„=i^,.._Ar containing N pairs of time 
series and their associated label. We learn the parameters 0 = {tt, r. A, W, V, b, c} 
of the hidden-unit logistic model by maximizing the conditional log-likelihood 
of the training data with respect to the parameters: 


N 

£{0)=Y,logp(y 






n—1 
N 

=E 

n—1 


logM (x(”^^,y(")) - log^M (x^”^^,/) 


( 6 ) 


We augment the conditional log-likelihood with L2-regularization terms on the 
parameters A, W, and V. As the objective function is not amenable to closed- 
form optimization (in fact, it is not even a convex function), we perform optimiza¬ 
tion using stochastic gradient descent on the negative conditional log-likelihood. 
The gradient of the conditional log-likelihood with respect to a parameter 9 G O 
is given by: 


dC _ TT? 

de ~ ^ 








^ ^ de p. 


.,T,y|xi,...,T) 

( 7 ) 

where we omitted the sum over training examples for brevity. The required 
expectations can readily be computed using the inference algorithm described 
in the previous subsection. 
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For example, defining r{0) = Zt-i,h-^hZt,h + Zt,h'Wh^t + zt,hyhU + Zt,hbh for 
notational simplicity, the first expectation can be computed as follows: 


80 


E 


P(zi_.,,,t|xi. T.y) 




The second expectation is simply an average of these expectations over all K 
possible labels y. 

4 Experiments 

To evaluate the performance of the hidden-unit logistic model, we conducted 
classification experiments on five different problems involving time series fea¬ 
tures: (1) an online handwritten character data set (OHC) [H]; (2) a data set 
of Arabic spoken digits (ASD) [S]; (3) the Cohn-Kanade extended facial expres¬ 
sion data set (CK-f) [T^; (4) the MSR Action 3D data set (Action) [T^; and 
(5) the MSR Daily Activity 3D data set (Activity) [26]. The five data sets are 
introduced in |4.1[ the experimental setup is presented in |4.2[ and the results of 
the experiments are in |4.3[ 

4.1 Data Sets 

The online handwritten character dataset [55] is a pen-trajectory time series data 
set that consists of three dimensions at each time step, viz., the pen movement in 
the x-direction and y-direction, and the pen pressure. The data set contains 2858 
time series with an average length of 120 frames. Each time series corresponds 
to a single handwritten character that has one of 20 labels. We pre-process the 
data by windowing the features of 10 frames into a single feature vector with 30 
dimensions. 

The Arabic spoken digit dataset contains 8800 utterances [5|, which were 
collected by asking 88 Arabic native speakers to utter all 10 digits ten times. Each 
time series consists of 13-dimensional MFCCs which were sampled at ll,025Hz, 
16-bits using a Hamming window. We enrich the features by windowing 3 frames 
into 1 frames resulting in the 13 x 3 dimensions for each frame of the features 
while keeping the same length of time series. 

The Cohn-Kanade extended facial expression data set [16] contains 593 im¬ 
age sequences (videos) from 123 subjects. Each video shows a single facial ex¬ 
pression. The videos have an average length of 18 frames. A subset of 327 of 
the videos, which have validated label corresponding to one of seven emotions 
(anger, contempt, disgust, fear, happiness, sadness, and surprise), are used in 
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our experiments. We adopt the publicly available shape features used in m as 
the feature representation for our experiments. These features represent each 
frame by the variation of 68 feature point locations {x, y) with respect to the 
first frame m, which leads to 136-dimensional feature representation for each 
frame in the video. 

The MSR Action 3D data set m consists of RGB-D videos of people per¬ 
forming certain actions. The data set contains 567 videos with an average length 
of 41 frames. Each video should be classified into one of 20 actions such as “high 
arm wave”, “horizontal arm wave”, and “hammer”. We use the real-time skele¬ 
ton tracking algorithm of [21] to extract the 3D joint positions from the depth 
sequences. We use the 3D joint position features (pairwise relative positions) 
proposed in |26j as the feature representation for the frames in the videos. Since 
we track a total of 20 joints, the dimensionality of the resulting feature represen¬ 
tation is 3 X (^2°) = 570, where (^2°) is the number of pairwise distances between 
joints and 3 is dimensionality of the {x,y,z) coordinate vectors. 

The MSR Daily Activity 3D data set |2S] contains RGB-D videos of people 
performing daily activities. The data set also contains 3D skeletal joint positions, 
which are extracted using the Kinect SDK. The videos need to be classified into 
one of 16 activity types, which include “drinking”, “eating”, “reading book”, 
etc. Each activity is performed by 10 subjects in two different poses (namely, 
while sitting on a sofa and while standing), which leads to a total of 320 videos. 
The videos have an average length of 193 frames. To represent each frame, we 
extract 570-dimensional 3D joint position features. 


4.2 Experimental Setup 

In our experiments, the model parameters A, W, V of the hidden-unit logis¬ 
tic model were initialized by sampling them from a Gaussian distribution with 
a variance of 10“^. The initial-state parameter tt, final-state parameter r and 
the bias parameters b,c were initialized to 0. Training of our model is per¬ 
formed using a standard stochastic gradient descent procedure; the learning rate 
is decayed during training. We set the number of hidden units H to 100. The 
L2-regularization parameter A was tuned by minimizing the error on a small 
validation set. 

We compare the performance of our hidden-unit logistic model with that of 
three other time series classification models: (1) the naive logistic model shown 
in Figure ^ (2) the popular HCRF model [22], and (3) Fisher kernel learning 
model [H . Details of these models are given below. 


Naive logistic model. The naive logistic model is a linear logistic model that 
shares parameters between all time steps, and makes a prediction by summing 
(or equivalently, averaging) the inner products between the model weights and 
feature vectors over time before applying the softmax function. Specifically, the 
naive logistic model defined the following conditional distribution over the label 
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y 



Fig. 2. Graphical model of the naive logistic model. 


y given the time series data 


T(y|xi,...,T) 


exp{£:(xi^...,r,y)} 


where the energy function is defined as 


T 

^^(xn...,T, y) = Yl (y’^Wxt) -f c^y. 

The corresponding graphical model is shown in Figure We include the naive 
logistic model in our experiments to investigate the effect of adding hidden units 
to models that average energy contributions over time. 


Hidden CRF. The hidden-state CRF’s graphical model is very similar to that 
of the hidden-unit logistic model [52] . The key difference between the two mod¬ 
els is in the way the hidden units are defined: whereas the hidden-unit logistic 
model uses a substantial number of binary stochastic hidden units to repre¬ 
sent the latent state, the FICRF uses a single multinomial unit (much like a 
hidden Markov model). We performed experiments using the hidden CRF im¬ 
plementation of [T], which learns the parameters of the model using L-BFGS. 
Following [22], we trained HCRFs with 10 latent states on all data sets. (We 
found it was computationally infeasible to train HCRFs with more than 10 la¬ 
tent states.) We tune the L2-regularization parameter of the HCRF on a small 
validation set. 


Fisher kernel learning. In addition to comparing with HCRFs, we compare the 
performance of our model with that of the recently proposed Fisher kernel learn¬ 
ing (FKL) model [IS] . We selected the FKL model for our experiments because 
[18j reports strong performance on a range of time series classification problems. 
We trained FKL models based on hidden Markov models with 10 hidden states 
(the number of hidden states was set identical to that of the hidden CRF). Sub¬ 
sequently, we computed the Fisher kernel representation and trained a linear 
SVM on the resulting features to obtain the final classifier. The slack parameter 
C of the SVM is tuned on a small validation set. 
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Fig. 3. (a) Generalization error (%) of the hidden-unit logistic model on the Arabic 
speech data set as a function of the number of hidden units, (b) Generalization error 
(%) of the hidden-unit logistic model and the hidden CRF on the CK-I- data set as a 
function of the hidden parameter number. 


4.3 Results 

We perforin two sets of experiments with the hidden-unit logistic model: (1) a 
set of experiments in which we evaluate the performance of the model (and of 
the hidden CRF) as a function of the number of hidden units and (2) a set of 
experiments in which we compare the performance of all models on all data sets. 
The two sets of experiments are described separately below. 

Effect of Varying the Number of Hidden Units. We first conduct exper¬ 
iments on the ASD data set to investigate the performance of the hidden-unit 
logistic model as a function of the number of hidden units. The results of these 
experiments are shown in Figure]^ a). The results presented in the figure show 
that the error initially decreases when the number of hidden unit increases, be¬ 
cause adding hidden units adds complexity to the model that allows it to better 
fit the data. However, as the hidden unit number increases further, the model 
starts to overfit on the training data despite the use of L2-regularization. 

We performed a similar experiment on the CK-|- facial expression data set, 
in which we also performed comparisons with the hidden CRF for a range of 
values for the number of hidden states. Figure [^b) presents the results of these 
experiments. On the CK-I- data set, there are no large fluctuations in the errors 
of the HULM as the hidden parameter number increases. The figure also shows 
that the hidden-unit logistic model outperforms the hidden CRF irrespective of 
the number of hidden units. For instance, a hidden-unit logistic model with 10 
hidden units outperforms even a hidden CRF with 100 hidden parameters. This 
result illustrates the potential merits of using models in which the number of 
latent states grows exponentially with the number of parameters. 

Comparison with Modern Time Series Classifiers. In a second set of 
experiments, we compare the performance of the hidden-unit logistic model with 
that of the naive logistic model, Fisher kernel learning, and the hidden CRF on 
all five data sets. In our experiments, the number of hidden units in the hidden- 
unit logistic model was set to 100; following [22], the hidden CRF used 10 latent 
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Table 1. Generalization errors (%) on all five data sets by four time series classification 
models: the naive logistic model (NL), Fisher kernel learning (FKL), the hidden CRF 
(HCRF), and the hidden-unit logistic model (HULM). The best performance on each 
data set is boldfaced. See text for details. 


Dataset 

Dim. 

Classes > 


Model 


NL 

FKL 

HCRF 

HULM 

OHC 

3x10 

20 

23.67 

0.97 

1.58 

1.30 

ASD 

13x3 

10 

25.50 

6.91 

3.68 

4.68 

CK+ 

136 

7 

9.20 

10.81 

11.04 

6.44 

Action 

570 

20 

40.40 

40.74 

34.68 

35.69 

Activity 

570 

16 

59.38 

43.13 

62.50 

45.63 

Avg. error 

- 

- 

31.63 

20.51 

22.70 

18.75 

Avg. rank 

- 

- 

3.2 

2.4 

2.6 

1.8 


states. The results of our experiments are presented in Tablej^ and are discussed 
for each data set separately below. 

Online handwritten character dataset (OHC). Following the experimental setup 
in |18j . we measure the generalization error of all our models on the online 
handwritten character dataset using 10-fold cross validation. The average gener¬ 
alization error of each model is shown in Table Whilst the naive logistic model 
performs very poorly on this data set, all three other methods achieve very low 
error rates. The best performance is obtained FKL, but the differences between 
the models are very small on this data set, presumably, due to a ceiling effect. 

Arabic spoken digits dataset (ASD). Following [3, the error rates for the Arabic 
spoken digits data set in Table were measured using a fixed training/test 
division: 75% of samples are used for training and left 25% of samples compose 
test set. The best performance on this data set is obtained by the hidden CRF 
model (3.68%), whilst our model has a slightly higher error of 4.68%, which 
in turn is better than the performance of FKL. It should be noted that the 
performance of the hidden CRF and the hidden-unit logistic model are better 
than the error rate of 6.88% reported in (on the same training/test division). 

Facial expression dataset (CK+). Table presents generalization errors mea¬ 
sured using 10-fold cross-validation. Folds are constructed in such a way that all 
videos by the same subject are in the same fold (the subjects appearing in test 
videos were not present in the training set). On the CK-I- data set, the hidden- 
unit logistic model substantially outperforms the hidden CRF model, obtaining 
an error of 6.44%. Somewhat surprisingly, the naive logistic model also outper¬ 
forms the hidden CRF model with an error of 9.20%. A possible explanation 
for this result is that the classifying these data successfully does not require ex¬ 
ploitation of temporal structure: many of the expressions can also be recognized 
well from a single frame. As a result, the naive logistic model may perform well 
even though it simply averages over time. This result also suggests that the hid¬ 
den CRF model may perform poorly on high-dimensional data (the CK-I- data 
is 136-dimensional) despite performing well on low-dimensional data such as the 








12 


Time Series Classification using the Hidden-Unit Logistic Model 


handwritten character data set (3-dimensional) and the Arabic spoken data set 
(13-dimensional). 

MSR Action 3D data set (Action). To measure the generalization error of the 
time series classification models on the MSR Action 3D dataset, we followed the 
experimental setup of [26| : we used all videos of the five subjects for training, 
and used the videos of the remaining five subjects for testing. Table presents 
the average generalization error on the videos of the five test subjects. The four 
models perform quite similarly, although the hidden CRF and the hidden-unit 
logistic model do appear to outperform the other two models somewhat. 

MSR Daily Activity 3D data set (Activity). On the MSR Daily Activity data 
set, we use the same experimental setup as on the action data set: five subjects 
are used for training and five for testing. The results in Table show that the 
hidden-unit logistic model substantially outperforms the hidden CRF on this 
challenging data set (but FKL performs slightly better). 

In terms of the average error rate and average rank over all data sets, the 
hidden-unit logistic model performs very strongly. Indeed, it substantially out¬ 
performs the hidden CRF model, which illustrates that using a collection of 
(conditionally independent) hidden units may be a more effective way to repre¬ 
sent latent states than a single multinomial unit. FKL also performs quite well 
in our experiments, although its performance is slightly worse than that of the 
hidden-unit logistic model. However, it should be noted here that FKL scales 
poorly to large data sets: its computational complexity is quadratic in the num¬ 
ber of time series, which limits its applicability to relatively small data sets (with 
fewer than, say, 10, 000 time series). By contrast, the training of hidden-unit lo¬ 
gistic models scales linearly in the number of time series and, moreover, can be 
performed using stochastic gradient descent. 

5 Application to Facial AU Detection 

In this section, we present a system for facial action unit (AU) detection that is 
based on the hidden-unit logistic model. We evaluate our system on the Cohn- 
Kanade extended facial expression database (CK-I-) |T6], evaluating its ability 
to detect 10 prominent facial action units: namely, AUl, AU2, AU4, AU5, AU6, 
AU7, AU12, AU15, AU17, and AU25. We compare the performance of our fa¬ 
cial action unit detection system with that of state-of-the-art systems for this 
problem. Before describing the results of these experiments, we first describe the 
feature extraction of our AU detection system and the setup of our experiments. 

5.1 Facial Features 

We extract two types of features from the video frames in the CK-I- data set: 
(1) shape features and (2) appearance features. Our features are identical to the 
features used by the system described in ini; the features are publicly available 
online. For completeness, we briefly describe both types of features below. 
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The shape features represent each frame by the vertical/horizontal displace¬ 
ments of facial landmarks with respect to the first frame. To this end, auto¬ 
matically detected/tracked 68 landmarks are used to form 136-dimensional time 
series. All landmark displacements are normalized by removing rigid transfor¬ 
mations (translation, rotation, and scale). 

The appearance features are based on grayscale intensity values. To capture 
the change in facial appearance, face images are warped onto a base shape, 
where feature points are in the same location for each face. After this shape nor¬ 
malization procedure, the grayscale intensity values of the warped faces can be 
readily compared. The final appearance features are extracted by subtracting the 
warped textures from the warped texture in the first frame. The dimensionality 
of the appearance feature vectors is reduced using principal components analy¬ 
sis as to retain 90% of the variance in the data. This leads to 439-dimensional 
appearance feature vectors, which are combined with the shape features to form 
the final feature representation for the video frames. For further details on the 
feature extraction, we refer to m 

5.2 Experimental Setup 

To gauge the effectiveness of the hidden-unit logistic model in facial AU detec¬ 
tion, we performed experiments on the CK-f database [16]. The database consists 
of 593 image sequences (videos) from 123 subjects with an average length of 18.1 
frames. The videos show expressions from neutral face to peak formation, and 
include annotations for 30 action units. In our experiments, we only consider the 
10 most frequent action units. 

Our AU detection system employs 10 separate binary classifiers for detecting 
action units in the videos. In other words, we train a separate HULM for each fa¬ 
cial action unit. An individual model thus distinguishes between the presence and 
non-presence of the corresponding action unit. We use a 10-fold cross-validation 
scheme to measure the performance of the resulting AU detection system: we 
randomly select one test fold containing 10% of the videos, and use remaining 
nine folds are used to train the system. The folds are constructed such that there 
is no subject overlap between folds: i.e., subjects appearing in the test data were 
not present in the training data. 


5.3 Results 

We ran experiments using the HULM on three feature sets: (1) shape features, 
(2) appearance features, and (3) a concatenation of both feature vectors. We 
measure the performance of our system using the area under ROC curve (AUC). 
Table shows the results for HULM, and for the baseline in m- The results 
show that the HULM outperforms the CRF baseline of with our best model 
achieving an AUC that is approximately 0.03 higher than the best result of m- 
To obtain insight in what features are modeled by the HULM hidden units, 
we visualized a single column of |W| in Figure|^for the AU4 and AU25 models 
that were trained on appearance features. Specifically, we selected the hidden 
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Table 2. AUC of the HULM and the CRF baseline in [IT] for three feature sets. 
*In [17], the combined feature set also includes SIFT features. 


Method 


Feature Set 


Shape 

Appearance 

Combination 

HULM 

0.9101 

0.9197 

0.9253 

na 

0.8902 

0.8971 

0.8647* 




(a) (b) 

Fig. 4. Visualization of |W| for (a) AU4 and (b) AU25. Brighter colors correspond to 
image regions with higher weights. 


Table 3. Average Fl-scores of our system and seven state-of-the-art systems on the 
CK-b data set. The FI scores for all methods were obtained from the literature. Note 
that the averages are not over the same AUs, and cannot readily be compared. The 
best performance for each condition is boldfaced. 


AU 

HULM 

m 

1251 

1141 

1151 

m 

[30] 

1 

0.91 

0.87 

0.83 

0.66 

0.78 

0.76 

0.88 

2 

0.85 

0.90 

0.83 

0.57 

0.80 

0.76 

0.92 

4 

0.76 

0.73 

0.63 

0.71 

0.77 

0.79 

0.89 

5 

0.63 

0.80 

0.60 

- 

0.64 

- 

- 

6 

0.69 

0.80 

0.80 

0.94 

0.77 

0.70 

0.93 

7 

0.57 

0.47 

0.29 

0.87 

0.62 

0.63 

- 

12 

0.88 

0.84 

0.84 

0.88 

0.90 

0.87 

0.90 

15 

0.72 

0.70 

0.36 

0.84 

0.70 

0.71 

0.73 

17 

0.89 

0.76 

- 

0.79 

0.81 

0.86 

0.76 

25 

0.96 

0.96 

0.75 

- 

0.88 

- 

0.73 

Avg. 

0.79 

0.78 

0.66 

0.78 

0.77 

0.76 

0.84 


unit with the highest corresponding V-value for visualization, as this hidden unit 
apparently models the most discriminative features. The figure shows that the 
appearance of the eyebrows is most important in the AU4 model (brow lowerer), 
whereas the mouth region is most important in the AU25 model (lips part). 

In Table we compare the performance of our AU detection system with 
that of seven other state-of-the-art systems in terms of the more commonly used 
Fl-score. (Please note that the averages are not over the same AUs, and cannot 
readily be compared.) The results in the table show that our system achieves 
the best FI scores for AUl, AU17, and AU25. It performs very strongly on most 
of the other AUs, illustrating the potential of the hidden-unit logistic model. 
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6 Conclusions 

In this paper, we presented the hidden-unit logistic model (HULM), a new model 
for the single-label classification of time series. The model is similar in structure 
to the popular hidden CRF model, but it employs binary stochastic hidden 
units instead of multinomial hidden units between the data and label. As a 
result, the HULM can model exponentially more latent states than a hidden 
CRF with the same number of parameters. The results of our experiments with 
HULM on several real-world datasets show that this may result in improved 
performance on challenging time-series classification tasks. In particular, the 
HULM performs very competitively on complex computer-vision problems such 
as facial expression recognition. 

In future work, we aim to explore more complex variants of our hidden-unit 
logistic model. In particular, we intend to study variants of the model in which 
the simple first-order Markov chains on the hidden units are replaced by more 
powerful, higher-order temporal connections. Specifically, we intend to imple¬ 
ment the higher-order chains via a similar factorization as used in neural autore¬ 
gressive distribution estimators m- The resulting models will likely have longer 
temporal memory than our current model, which will likely lead to stronger 
performance on complex time series classification tasks. A second direction for 
future work we intend to explore is an extension of our model to multi-task 
learning. Specifically, we will explore multi-task learning scenarios in which se¬ 
quence labeling and time series classification is performed simultaneously (for 
instance, simultaneous recognition of short-term actions and long-term activities, 
or simultaneous optical character recognition and word classification). By per¬ 
forming sequence labeling and time series classification based on the same latent 
features, the performance on both tasks may be improved because information 
is shared in the latent features. 
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